NHN Cloud 제목 추출 보정#18
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe PR adds title validation and cleaning logic to the sitemap crawler to prevent URL-like strings from being stored as article titles. New helper functions detect and filter URL-like titles, with NHN Cloud-specific normalization. Both NHN Cloud API extraction and general HTML parsing now raise ChangesArticle Title Validation and Cleaning for Sitemap Crawler
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
승인된 내용
NHN Cloud Meetup 수집 데이터에서 URL 같은 값이 article title로 저장되지 않도록 제목 추출을 보정합니다.
변경 사항
postPerLang.title을 정규화해서 제목으로 사용합니다.NHN Cloud Meetupsuffix가 중복되지 않도록 처리합니다.http/httpsURL 또는domain/path형태의 제목 후보를 거부합니다.의도적으로 제외한 것
검증
uv run pytest tests/test_sitemap_crawler.py-> 4 passeduv run ruff check app/crawler/sitemap.py tests/conftest.py tests/test_sitemap_crawler.py-> All checks passednhn-cloud-meetupsource의 URL 형태 title count가 0건임을 확인사람이 확인할 방법
cd apps/backend로 이동합니다.SkippedArticleError로 처리되는 테스트를 확인합니다.Summary by CodeRabbit